Create a Machine Learning Algorithm to identify possible business insigts based on different information related to business reviews extracted from an open datasource of Yelp page.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
from matplotlib.pyplot import *
import json
import geopandas as gpd
import plotly.express as px
from sklearn.preprocessing import OrdinalEncoder
import warnings
warnings.filterwarnings('ignore')
url_business="C:\\Users\\DobleP\\Desktop\\proyecto_yield\\data\\yelp_academic_dataset_business.json"
df_business=pd.read_json(url_business,
lines=True,
orient='columns')
print('The shape of the dataset of business contains', str(df_business.shape[0]),'rows and ',str(df_business.shape[1]),\
' columns')
print('Here is a list of the first 5 rows of the dataset')
df_business.head(5)
The shape of the dataset of business contains 150346 rows and 14 columns Here is a list of the first 5 rows of the dataset
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pns2l4eNsfO8kk83dixA6A | Abby Rappoport, LAC, CMQ | 1616 Chapala St, Ste 2 | Santa Barbara | CA | 93101 | 34.426679 | -119.711197 | 5.0 | 7 | 0 | {'ByAppointmentOnly': 'True'} | Doctors, Traditional Chinese Medicine, Naturop... | None |
| 1 | mpf3x-BjTdTEA3yCZrAYPw | The UPS Store | 87 Grasso Plaza Shopping Center | Affton | MO | 63123 | 38.551126 | -90.335695 | 3.0 | 15 | 1 | {'BusinessAcceptsCreditCards': 'True'} | Shipping Centers, Local Services, Notaries, Ma... | {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ... |
| 2 | tUFrWirKiKi_TAnsVWINQQ | Target | 5255 E Broadway Blvd | Tucson | AZ | 85711 | 32.223236 | -110.880452 | 3.5 | 22 | 0 | {'BikeParking': 'True', 'BusinessAcceptsCredit... | Department Stores, Shopping, Fashion, Home & G... | {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ... |
| 3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
| 4 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 4.5 | 13 | 1 | {'BusinessAcceptsCreditCards': 'True', 'Wheelc... | Brewpubs, Breweries, Food | {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2... |
df_business.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150346 entries, 0 to 150345 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 business_id 150346 non-null object 1 name 150346 non-null object 2 address 150346 non-null object 3 city 150346 non-null object 4 state 150346 non-null object 5 postal_code 150346 non-null object 6 latitude 150346 non-null float64 7 longitude 150346 non-null float64 8 stars 150346 non-null float64 9 review_count 150346 non-null int64 10 is_open 150346 non-null int64 11 attributes 136602 non-null object 12 categories 150243 non-null object 13 hours 127123 non-null object dtypes: float64(3), int64(2), object(9) memory usage: 16.1+ MB
- Information in the dataframe:
The complete dataframe of business consist on 14 rows with information of each business that is part of the Yelp. In this dataset we can find information about the different business that are in the yelp platform: the location of the stablishment, the name of the business, the type of business, amoung other information highly important to this analyzis.
Because the size of the data is really big, we are going to use the function chunksize that will help filter different small amount of rows to select the data that will be worked. We used a number of 1000 chunksize of fata to handle the dataset.
# Crear una lista vacia con los nuevos valores que van a estar formando el dataset
chunks = []
size = 1000
# Specify dtypes
r_dtypes = {"stars": np.float16,
"useful": np.int32,
"funny": np.int32,
"cool": np.int32}
reader = pd.read_json("C:\\Users\\DobleP\\Desktop\\proyecto_yield\\data\\yelp_academic_dataset_review.json",
lines=True,
chunksize=size,
dtype= r_dtypes,
orient='records')
for chunk in reader:
reduced_chunk = chunk.drop(columns=['review_id'])\
.query("`date` >= '2006-12-31'")
chunks.append(reduced_chunk)
# Concatenate each chunk into a single dataframe
df_reviews = pd.concat(chunks, ignore_index=True)
print(f'The size of the dataset is ',str(df_reviews.shape[0]),'rows, with ',str(df_reviews.shape[1]),'columns')
print('Here is a list of the 5 first review information')
df_reviews.head(5)
The size of the dataset is 6985581 rows, with 8 columns Here is a list of the 5 first review information
| user_id | business_id | stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|---|---|
| 0 | mh_-eMZ6K5RLWhZyISBhwA | XQfwVwDr-v0ZS3_CbbE5Xw | 3.0 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 |
| 1 | OyoGAe7OKpv6SyGZT5g77Q | 7ATYjTIgM3jUlt4UM3IypQ | 5.0 | 1 | 0 | 1 | I've taken a lot of spin classes over the year... | 2012-01-03 15:28:18 |
| 2 | 8g_iMtfSiwikVnbP2etR0A | YjUWPpI6HXG530lwP-fb2A | 3.0 | 0 | 0 | 0 | Family diner. Had the buffet. Eclectic assortm... | 2014-02-05 20:30:30 |
| 3 | _7bHUi9Uuf5__HHc_Q8guQ | kxX2SOes4o-D3ZQBkiMRfA | 5.0 | 1 | 0 | 1 | Wow! Yummy, different, delicious. Our favo... | 2015-01-04 00:01:03 |
| 4 | bcjbaE6dDog4jkNY91ncLQ | e4Vwtrqf-wpJfwesgvdgxQ | 4.0 | 1 | 0 | 1 | Cute interior and owner (?) gave us tour of up... | 2017-01-14 20:54:15 |
df_reviews.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6985581 entries, 0 to 6985580 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 user_id object 1 business_id object 2 stars float16 3 useful int32 4 funny int32 5 cool int32 6 text object 7 date datetime64[ns] dtypes: datetime64[ns](1), float16(1), int32(3), object(3) memory usage: 306.5+ MB
- Information in the dataframe:
The dataframe "review" contains the information about the different reviews and ratings that different users assign to the different business. It contains the stars given to different busineess, the user_id that placed the review and the date which the review was placed amoung other information. This dataset can also give a lot of insights on the reviews made by each user.
url_check= "C:\\Users\\DobleP\\Desktop\\proyecto_yield\\data\\yelp_academic_dataset_checkin.json"
df_checkin=pd.read_json(url_check,
lines=True,
orient='columns')
print(f'The size of the dataset is ',str(df_checkin.shape[0]),'rows, with ',str(df_checkin.shape[1]),'columns')
print('Here is a list of the 5 first checkin information')
df_checkin.head(5)
The size of the dataset is 131930 rows, with 2 columns Here is a list of the 5 first checkin information
| business_id | date | |
|---|---|---|
| 0 | ---kPU91CF4Lq2-WlRu9Lw | 2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020... |
| 1 | --0iUa4sNDFiZFrAdIWhZQ | 2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011... |
| 2 | --30_8IhuyMHbSOcNWd6DQ | 2013-06-14 23:29:17, 2014-08-13 23:20:22 |
| 3 | --7PUidqRWpRSpXebiyxTg | 2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012... |
| 4 | --7jw19RH9JKXgFohspgQw | 2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014... |
df_checkin.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 131930 entries, 0 to 131929 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 business_id 131930 non-null object 1 date 131930 non-null object dtypes: object(2) memory usage: 2.0+ MB
- Information in the dataframe:
This dataset only provides information about the business_id and the date which that specific business check in in Yelp.
Because the size of the data is really big, we are going to use the function chunksize that will help filter different small amount of rows to select the data that will be worked. We used a number of 1000 chunksize of data to handle the dataset.
# Crear una lista vacia con los nuevos valores que van a estar formando el dataset
chunks = []
size = 1000
reader_user = pd.read_json("C:\\Users\\DobleP\\Desktop\\proyecto_yield\\data\\yelp_academic_dataset_user.json",
lines=True,
chunksize=size,
orient='columns')
for chunk in reader_user:
chunks.append(chunk)
df_user = pd.concat(chunks, ignore_index=True)
print(f'The size of the dataset is ',str(df_user.shape[0]),'rows, with ',str(df_user.shape[1]),'columns')
print('Here is a list of the 5 first User information')
df_user.head(5)
The size of the dataset is 1987897 rows, with 22 columns Here is a list of the 5 first User information
| user_id | name | review_count | yelping_since | useful | funny | cool | elite | friends | fans | ... | compliment_more | compliment_profile | compliment_cute | compliment_list | compliment_note | compliment_plain | compliment_cool | compliment_funny | compliment_writer | compliment_photos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | qVc8ODYU5SZjKXVBgXdI7w | Walker | 585 | 2007-01-25 16:47:26 | 7217 | 1259 | 5994 | 2007 | NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA... | 267 | ... | 65 | 55 | 56 | 18 | 232 | 844 | 467 | 467 | 239 | 180 |
| 1 | j14WgRoU_-2ZE1aw1dXrJg | Daniel | 4333 | 2009-01-25 04:35:42 | 43091 | 13066 | 27281 | 2009,2010,2011,2012,2013,2014,2015,2016,2017,2... | ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A... | 3138 | ... | 264 | 184 | 157 | 251 | 1847 | 7054 | 3131 | 3131 | 1521 | 1946 |
| 2 | 2WnXYQFK0hXEoTxPtV2zvg | Steph | 665 | 2008-07-25 10:41:00 | 2086 | 1010 | 1003 | 2009,2010,2011,2012,2013 | LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA... | 52 | ... | 13 | 10 | 17 | 3 | 66 | 96 | 119 | 119 | 35 | 18 |
| 3 | SZDeASXq7o05mMNLshsdIA | Gwen | 224 | 2005-11-29 04:38:33 | 512 | 330 | 299 | 2009,2010,2011 | enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg... | 28 | ... | 4 | 1 | 6 | 2 | 12 | 16 | 26 | 26 | 10 | 9 |
| 4 | hA5lMy-EnncsH4JoR-hFGQ | Karen | 79 | 2007-01-05 19:40:59 | 29 | 15 | 7 | PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA... | 1 | ... | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
5 rows × 22 columns
df_user.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1987897 entries, 0 to 1987896 Data columns (total 22 columns): # Column Dtype --- ------ ----- 0 user_id object 1 name object 2 review_count int64 3 yelping_since object 4 useful int64 5 funny int64 6 cool int64 7 elite object 8 friends object 9 fans int64 10 average_stars float64 11 compliment_hot int64 12 compliment_more int64 13 compliment_profile int64 14 compliment_cute int64 15 compliment_list int64 16 compliment_note int64 17 compliment_plain int64 18 compliment_cool int64 19 compliment_funny int64 20 compliment_writer int64 21 compliment_photos int64 dtypes: float64(1), int64(16), object(5) memory usage: 333.7+ MB
- Information in the dataframe:
In this dataset, the focus is on the users that are giving reviews to the different business that are in the yelp platfomr. Their user_id, how many reviews did they placed, how long each user has using the yelp platform, their frineds, among other information.
This dataset can be useful to evaluate and segmentate the different types/class of users that are using the yelp platform.
url_tip="C:\\Users\\DobleP\\Desktop\\proyecto_yield\\data\\yelp_academic_dataset_tip.json"
df_tip=pd.read_json(url_tip,
lines=True,
orient='columns')
print(f'The size of the dataset is ',str(df_tip.shape[0]),'rows, with ',str(df_tip.shape[1]),'columns')
print('Here is a list of the 5 first Tip information')
df_tip.head(5)
The size of the dataset is 908915 rows, with 5 columns Here is a list of the 5 first Tip information
| user_id | business_id | text | date | compliment_count | |
|---|---|---|---|---|---|
| 0 | AGNUgVwnZUey3gcPCJ76iw | 3uLgwr0qeCNMjKenHJwPGQ | Avengers time with the ladies. | 2012-05-18 02:17:21 | 0 |
| 1 | NBN4MgHP9D3cw--SnauTkA | QoezRbYQncpRqyrLH6Iqjg | They have lots of good deserts and tasty cuban... | 2013-02-05 18:35:10 | 0 |
| 2 | -copOvldyKh1qr-vzkDEvw | MYoRNLb5chwjQe3c_k37Gg | It's open even when you think it isn't | 2013-08-18 00:56:08 | 0 |
| 3 | FjMQVZjSqY8syIO-53KFKw | hV-bABTK-glh5wj31ps_Jw | Very decent fried chicken | 2017-06-27 23:05:38 | 0 |
| 4 | ld0AperBXk1h6UbqmM80zw | _uN0OudeJ3Zl_tf6nxg5ww | Appetizers.. platter special for lunch | 2012-10-06 19:43:09 | 0 |
df_tip.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 908915 entries, 0 to 908914 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 908915 non-null object 1 business_id 908915 non-null object 2 text 908915 non-null object 3 date 908915 non-null datetime64[ns] 4 compliment_count 908915 non-null int64 dtypes: datetime64[ns](1), int64(1), object(3) memory usage: 34.7+ MB
- Information in the dataframe:
This dataset shows information of the Tip that is a small review made to the different businesses.
We are going to verify if there are any missing or null values inside the datasets that we are going to continue working with
# Verify the null values in the dataframe of business
null_col=df_business.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()
# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_business)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
| index | null_values | total_rows | miss_percentage | |
|---|---|---|---|---|
| 0 | hours | 23223 | 150346 | 15.446 |
| 1 | attributes | 13744 | 150346 | 9.142 |
| 2 | categories | 103 | 150346 | 0.069 |
# Verify the null values in the dataframe of business
null_col=df_reviews.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()
# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_reviews)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
| index | null_values | total_rows | miss_percentage |
|---|
# Verify the null values in the dataframe of business
null_col=df_checkin.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()
# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_checkin)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
| index | null_values | total_rows | miss_percentage |
|---|
# Verify the null values in the dataframe of business
null_col=df_user.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()
# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_user)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
| index | null_values | total_rows | miss_percentage |
|---|
# Verify the null values in the dataframe of business
null_col=df_tip.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()
# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_tip)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
| index | null_values | total_rows | miss_percentage |
|---|
Summary
The only dataset that is going to be used and have missing values is the dataset of business with information missing in the variables of hours (the hours which the business is open), attribute and categories.
We are going to handle the missing values as the following:
1. category <- for the rows with missing values in category we are going to drop those rows, as the number of misssing values is only 0.069%
2. hours & attributes <- we are going to impute the missing values with a 'None' message to specify that there were missing information on those particular cells as for the hour column there are more than 15% of values missing information. As we dont know why the information is missing we are going to remove the NaN but specifying that the original information was not recorded.
# fill the missing values with 'None' in hours and attribute rows
df_business['hours']=df_business['hours'].fillna('None')
df_business['attributes']=df_business['attributes'].fillna('None')
# Dropping the rows with missing values in category
df_business.dropna(axis = 0, inplace = True)
# Verify the null values in the dataframe of business
null_col=df_business.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()
# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_business)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
| index | null_values | total_rows | miss_percentage |
|---|
df_business.shape
(150243, 14)
a. Explore the different kinds of business category and the most common categories inside yelp database
# Create the new dataframe to verify the most frequent categories
business_categ=(', '.join(df_business['categories'].dropna())) #create a series of all possible categories unique
business_categ=pd.DataFrame(business_categ.split(','),columns=['category']) # divide the series to rows and convert to DataFrame
business_categ=business_categ.value_counts().to_frame().reset_index() #count the number of each category
business_categ.columns=[ 'category','num_business'] #rename the columns
# Plot the created dataset in a graph to visualize better the information
plt.figure(figsize=(15,10))
ax=sns.barplot(data=business_categ.iloc[0:15], y= 'category', x='num_business', palette="crest")
ax.set_ylabel('Category')
ax.set_xlabel('Number of business')
ax.set_title('Top 15 Business Categories in Yelp Database')
# Add the labels to each bar in the
for p in ax.patches:
ax.annotate(int(p.get_width()),
((p.get_x() + p.get_width()),
p.get_y()),
xytext=(1, -18),
fontsize=12,
color='black',
textcoords='offset points',
horizontalalignment='left')
plt.show()
The top category of business in the yelp platform is restaurant with 52268 different restaurants in the database. After restaurants, the food is the second category with more business that are focus on it. This information of the different categories of business and which are the most frecuent business categories that can be found in the Yelp Dataset can help to identify which categories we can focus first in the analizys.
For example if we want to identify which can be the different kinds of customers that assign different starst puntuation to the different business we can start with the restaurant categories that is the primer business in the yelp platform.
b. Visualization of the Number of reviews by state
# Plot the graph to see the states with the most reviews written to different business
fig=px.choropleth(df_business[['state','review_count']].groupby('state').sum().reset_index(),
locations='state',
locationmode="USA-states",
scope='usa',
color='review_count',
color_continuous_scale="Viridis",
labels={'review_count':'Number reviews per State'}
)
fig.show()
With this geograpich map we can see where the majority of reviewx are written. If we want to segment on a specific state to evaluate the reviews one good state could by PA or the states that are on the East Coast of US (that are the states with the major amount of reviews), as is the state whit the majority number of reviews in the complete United States.
Also we could identify that from the total of states that were in the original dataset, there are some values that are not a United States state abbreviation. We are going to identify which are those 2 not US states.
# Describe the location of the file
url_state_label= "C:\\Users\\DobleP\\Desktop\\proyecto_yield\\data\\us-states.json"
# Create the dataframe from the json file
state_label=pd.read_json(url_state_label, typ='series').to_frame().reset_index()
state_label.columns=['state','state_desc']
state_abbr=state_label['state'].to_list() # create a list with all the states abbreviation
# Create a list with all the unique values of state in the dataset
uni_states=df_business['state'].unique()
# Create new list that will save the values that are a US state abbreviation and the states that are not an abbreviation
no_state=[]
state_contain=[]
# Create a loop that goes through our unique values of state abbreviation and identify the values that are not an state
for elements in uni_states:
if elements in state_abbr:
state_contain.append(elements)
else:
no_state.append(elements)
# The states that are not part of the US are
no_state
['CA', 'MO', 'AZ', 'PA', 'TN', 'FL', 'IN', 'LA', 'AB', 'NV', 'ID', 'DE', 'IL', 'NJ', 'NC', 'CO', 'WA', 'HI', 'UT', 'TX', 'MT', 'MI', 'SD', 'XMS', 'MA', 'VI', 'VT']
From the total of 27 states abbreviation in the dataset, there are 2 that are not a US states. AB represents a state in Canada and XMS represents a state in England. For further analysis we can decide wheter or not to focus only in the US states.
c. Visualization of the Average star range by state
fig=px.choropleth(df_business[['state','stars']].groupby('state').mean().reset_index(),
locations='state',
locationmode="USA-states",
scope='usa',
color='stars',
color_continuous_scale="Viridis",
labels={'stars':'Average Star range per State'}
)
fig.show()
In here we are visualizing how the stars are being distributed to all the different states in the US.
a. Evaluation of the behavior of the amount of reviews by month
review_series = df_reviews['date'].value_counts()
review_series.resample('M').sum().plot()
<AxesSubplot: >
In this chart we can see the behavior of the amount of reviews placed in the Yelp platform by month. With this information, we could have insights to make future predictions about the amount of reviews that are going to be placed at yelp to specific categories.
From the chart we can see a decrease in the amount of reviews that could be as a result of the COVID pandemic in 2020.
b. Average starts based on the amount of reviews that users placed
plt.figure(figsize=(10,5))
ax = sns.countplot( x='stars' , data = df_reviews, palette='crest')
ax.set_ylabel('Count')
ax.set_xlabel('Stars')
ax.set_title('Average Star Rating by Reviews in the Yelp Database')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width-.4,
y+height,
'{:.0f}'.format(height),
horizontalalignment='center')
plt.show()
The average of stars rating per number of reviews does not follow a normal distribution because as we can see, we have a high amount of reviews with a rating of 5.
a. Visualize the distribution of the average star given by users
plt.figure(figsize=(7,5))
ax = sns.histplot( x='average_stars' , data = df_user, palette='crest', binwidth=1)
ax.set_ylabel('Count')
ax.set_xlabel('Stars')
ax.set_title('Average star of assign by users')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width-.4,
y+height,
'{:.0f}'.format(height),
horizontalalignment='center')
plt.show()
By the plot graph, it is possible to see that the average range of starts assign by all the users of the yelp platform. We can see that the majority of the users give a high range of stars to the business and only a low proportion of users give an average of bad reviews.
b. Correlation between the some selected variables
df_new=(', '.join(df_user['friends']))
df_new
plt.figure(figsize=(10,8))
df_user_corr=df_user.copy()
df_user_corr=df_user_corr.drop(columns=['user_id','name','friends','yelping_since','elite','friends'], axis=1)
sns.heatmap(df_user_corr.corr())
With the correlation heatmap used in the user dataframe we can start havins a clue of some variables that are correlated with each other. This can help in the future having a better understanding of the different classes of users that are in the yelp dataset. Some variables are more correlated to others, but this can help us give an initial perspective of how the users variables are related from one to the other. This will be more helpful in the modeling, when doing a cluster of the users of the yelp dataframe.
From the dataset provided by the Yelp company we can create different models depending on the business insight or the question that wants to be answered from the data. Some examples of possible business insights that can be obtain with the data are:
We are going to focus on the modeling process on a ML algorithm to classify text and predict the star rating and in an unsupervise ML to classify users.
As we are going to work with the user dataset, we are going to make some changes to the original dataset in order to have the correct information extract from them
# Create a copy of primery user dataset andDrop the user_id and the name as this information is not necessary
user_info=df_user.copy()
user_info=user_info.drop(['user_id','name'], axis=1)
# Create a function that will count all the values inside each list that are seperate by a comma
def count_val(column_name):
num_elements=[]
for elements in user_info[column_name]:
if len(elements)==0:
num_elements.append(0)
else:
num_elements.append(len(elements.split(',')))
return num_elements
# Add the new column to the existing dataset of user_info and after drop the friends columns
user_info['num_friends']=count_val('friends')
user_info['times_elite']=count_val('elite')
# Eliminate the original columns of friends and elite from the dataset
user_info=user_info.drop(['friends','elite'],axis=1)
# Change the year columns to specify only the year on which each user enter the platform
user_info['yelping_since']=pd.to_datetime(user_info['yelping_since'],format='%Y-%m-%d %H:%M:%S') #convert from obj to date
user_info['yelping_since']=user_info['yelping_since'].dt.strftime('%Y').astype('int') # extract only the year to analyse
# # Dataset of reviews to parquet.
df_reviews['stars']=df_reviews['stars'].astype(float) # We do this to be able to save the dataset into a parquet
df_reviews.to_parquet("../data/reviews_data.parquet")
# Dataset of user_info
user_info['average_stars']=user_info['average_stars'].astype('float') # We do this to be able to save the dataset into a parquet
user_info.to_parquet("../data/user_info_data.parquet")